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ABSTRACT 

This paper discusses the relationship between the sequential hard c-means (SHCM), learning 
vector quantization (LVQ), and fuzzy c-means (FCM) clustering algorithms. LVQ and SHCM 
suffer from several major problems. For example, they depend heavily on initialization. If the 
initial values of the cluster centers are outside the convex hull of the input data, such 
algorithms, even if they terminate, may not produce meaningful results in terms of prototypes 
for cluster representation. This is due in part to the fact that they update only the winning 
prototype for every input vector. We also discuss the Impact and interaction of these two 
families with Kohonen's self- organizing feature mapping (SOFM). which is not a clustering 
method, but which often lends ideas to clustering algorithms. Then we present two 
generalizations of LVQ that are explicitly designed as clustering algorithms; we refer to these 
algorithms as generalized LVQ = GLVQ; and fuzzy LVQ = FLVQ. Learning rules are derived to 
optimize an objective function whose goal is to produce "good clusters". GLVQ/FLVQ (may) 
update every node in the clustering net for each input vector. Neither GLVQ nor FLVQ depends 
upon a choice for the update neighborhood or learning rate distribution - these are taken care 
of automatically. Segmentation of a gray tone image is used as a typical application of these 
algorithms to illustrate the performance of GLVQ/FLVQ . 
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1. INTRODUCTION : LABEL VECTORS AND CLUSTERING 


Clustering algorithms attempt to organize unlabeled feature vectors into clusters or "natural 
groups" such that points within a cluster are more similar to each other than to vectors 
belonging to different clusters. Treatments of many classical approaches to this problem 
include the texts by Kohonen 1 , Bezdek 2 , Duda and Hart 3 , Tou and Gonzalez 4 . Hartigan 5 , and 
Dubes and Jain 6 . Kohonen's work has become timely in recent years because of the widespread 
resurgence of interest in the theory and applications of neural network structures 1 . 

Label Vectors. To characterize solution spaces for clustering and classifier design, let c denote 
the number of clusters, 1 < c < n, and set : 

Nfcu -ly e ^ 1 KX 11 V k) = (unconstrained) fuzzy labels ; (la) 

Nf c = (y 6 Nj- Cu I Iy k = 1} = (constrained ) fuzzy labels ; (lb) 

N c = {ye Nj- c I y^e (0, 1} V k) = hard labels fore classes (lc) 

N c is the canonical basis of Euclidean c-space; N^. is its convex hull; and Nj- cu is the unit 

hypercube in 9t c . Figure 1 depicts these sets for c=3. For example, the vector y = (. 1. .6. ,3) T is a 

typical constrained fuzzy label vector; its entries lie between 0 and 1. and sum to 1 . And because 
its entries sum to 1, y may also be interpreted as a probabilistic label. The cube Nf cu = [0, 1| 3 is 

called unconstrained fuzzy label vector space; vectors such as z = (.7, .2, .7)^ have each entry 
between 0 and 1 , but are otherwise unrestricted. 

Cluster Analysis. Given unlabeled data X = {x j , x^ in 9t p , clustering in X is assignment 

of (hard or fuzzy) label vectors to the objects generating X. If the labels are hard, we hope that 
they identify c "natural subgroups" in X. Clustering is also called unsupervised learning , the 
word learning referring here to learning the correct labels (and possibly vector prototypes or 
quantizers) for "good" subgroups in the data, e partitions of X are characterized as sets of (cn) 
values (ujjj) satisfying some or all of the following conditions : 


VI 

a 

3 
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V i.k 

; (2a) 

0 < Eujk < n 
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; (2b) 

II 

3 

W 

Vk 

(2c) 


200 



Fig. 1. Hard, fuzzy and probabilistic label vectors (for c = 3 classes). 



Using equations (2) with the values {u^} arrayed as a (cxn) matrix U = [u^l, we define: 

Mfcnu = (U e < J? cn I satisfies (2a) and (2b) V i, k) ; (3a) 

= (U e M fcnu I satisfies (2c) V i and k). ; (3b) 

= (U e M fcn I u jk = Oorl V i and k) (3c) 

Equations (3a), (3b) and (3c) define, respectively, the sets of unconstrained fuzzy, constrained 

fuzzy (or probabilistic), and crisp c-partltions of X. We represent clustering algorithms as 
mappings A : X-» M fcnu . Each column of U in M fcnu (M fcn . M cn ) is a label vector from N fcu 

(Nr N ) The reason these matrices are called partitions follows from the interpretation of 
' IC * c 

Ujk as the membership of in the i-th partitioning subset (cluster) of X. Mf cnu and Mf cn can 
be more realistic physical models than M cn , for it is common experience that the boundaries 

between many classes of real objects (e.g., tissue types in magnetic resonance images) are in 
fact very badly delineated (i.e., really fuzzy) . so M fcnu provides a much richer means for 
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representing and manipulating data that have such structures. We give an example to illustrate 
hard and fuzzy c-partitions of X. Let X = (xj. x 2 . x 3 } = {peach, plum, nectarine}, and let c=2. 

Typical 2-partitions of these three objects are shown in Table 1: 



Table 1. 2-partitions of X = {z^, x^l = 

(peach, plum, nectarine) 


HardU^M^ 

Fuzzy U 2 € Mf 23 

Fuzzy U 3 e 

Object 

X 1 *2 *3 

X 1 *2 *3 

X 1 *2 *3 

Peaches 

Plums 

1 1 

0 

O 

■“* o 

1 1 

0.9 0.2 0.41 
0.1 0.8 0.6J 

[0.9 0.5 0.51 
[0.6 0.8 0.7j 


The nectarine, x 3 , is shown as the last column of each partition, and in the hard case, it must 

be (erroneously) given full membership in one of the two crisp subsets partitioning this data; in 
Uj x 3 is labeled "plum". Fuzzy partitions enable algorithms to (sometimes!) avoid such 

mistakes. The final column of the first fuzzy partition in Table 1 allocates most (0.6) of the 
membership of x 3 to the plums class; but also assigns a lesser membership of 0.4 to x 3 as a 

peach. The last partition in Table 1 illustrates an unconstrained set of membership 
assignments for the objects in each class. Columns like the one for the nectarine in the two 
fuzzy partitions serve a useful purpose - lack of strong membership in a single class is a signal 
to "take a second look". Hard partitions of data cannot suggest this. In the present case, the 

nectarine is an hybrid of peaches and plums, and the memberships shown for it in the last 
column of either fuzzy partition seem more plausible physically than crisp assignment of x 3 to 

an incorrect class. It is appropriate to note that statistical clustering algorithms - e.g., 
unsupervised learning with maximum likelihood - also produce solutions in Mf cn . Fuzzy 

clustering began with Rusplni 8 ; see Bezdek and Pal 9 for a number of more recent papers on this 
topic. Algorithms that produce unconstrained fuzzy partitions of X are relatively new; for 
example, see the work of Krishnapuram and Keller 10 . 

Prototype classification is illustrated in Figure 2. Basically, the vector Vj is taken as a 
prototypical representation for all the vectors in the hard cluster X ( cX. There are many 

synonyms for the word prototype in (he literature: for example, quantizer (hence LVQ), 
signature, template, paradigm, exemplar. In the context of clustering, of course, we view v ( as 

the cluster center of hard cluster X ( c X. Each of the clustering algorithms discussed in this 
paper will produce a set of c prototype vectors V = (v^) from any unlabeled or labeled input data 
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set X in 9t p . Once the prototypes are found (and possibly relabeled if the data have physical 
labels), they define a hard nearest prototype (NP) classifier, say fi Np y : 


Crisp Nearest Prototype (1 -NP) Classifier. Given prototypes V = (v k 1 1< k< c) and 5t p : 

Decide ze i oD^y*) = e { <=> 1 <j <c, J^i (4) 

In (4) A is any positive definite pxp weight matrix - it renders the norm in (4) an inner produc t 
norm. That is, the distance from z to any Vj is computed as |z - v ( |^ = -^(z - v ( ) T A(z - v ( ) . 

Equation (4) defines a hard classifier, even though its parameters may come from a fuzzy 
algorithm. It would be careless to call fi Np y a fuzzy classifier just because fuzzy c-means 

produced the prototypes, for example, because (4) can be implemented, and has the same 
geometric structure, using prototypes {v^} from any algorithm that produces them. The (v^) 

can be sample means of hard clusters (HCM); cluster centers of fuzzy clusters (FCM); weight 

vectors attached to the nodes in the competitive layer of a Kohonen clustering network (LVQ); 
or estimates of the (c) assumed mean vectors {p^} in maximum likelihood decomposition of 

mixtures. 

Flguie 2. Representation of many vectors by one prototype (vector q uan t ize r). 


Xi 



The geometry of the 1-NP classifier is shown in Figure 3. using Euclidean distance for (4) - that 

is A=I, the pxp identity matrix. The 1-NP design erects a linear boundary halfway between and 

orthogonal to the line connecting the i-th and j-th prototypes, viz., the hyperplane HP through 
the vector ( - v )/2 perpendicular to it. All NP designs defined with inner product norms use 

(piecewise) linear decision boundaries of this kind. 
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Figure 3. Geometry of the Nearest Prototype Classifier fin- Inner Product Norms 



z 


i 


Clustering algorithms imaged in M fcnu eventually "defuzzify" or "deprobabilize” their label 

vectors, usually using the maximum membership (or maximum probability) strategy on the 
terminal fuzzy (or probabilistic) c -partitions produced by the data: 

Maximum membership (MM) conversion of U in M r to U»„, in Mr 

lcnu MM IC ' 


U 


MAf 


1 ; 

0 ; 


u > u ,, 1 < s < c,s * i 

ik sk 


otherwise 


l<i<c; l<k<n 


(5) 


U MM ls alwa Y s a haf d c-partition; we use this conversion to generate a confusion matrix and 

error statistics when processing labeled data with FCM and FLVQ. For HCM/FCM/LVQ/FLVQ, 
using (5) instead of (4) with the terminal prototypes secured is fully equivalent- that is. 

is the hard partition that would be created by applying (5) with the final cluster centers to the 
unlabeled data. This is not true for GLVQ. 


2. LEARNING VECTOR QUANTIZATION AND SEQUENTIAL HARD C -MEANS 

Kohonen’s name is associated with two very different, widely studied and often confused 
families of algorithms. Specifically. Kohonen initiated study of the prototype generation 
algorithm called learning vector quantization (LVQ); and he also introduced the concept of 
self organizing feature maps (SOFM) for visual display of certain one and two dimensional 
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data sets 1 . LVQ is not a clustering algorithm per se; rather, it can be used to generate crisp 
(conventional or hard) c-partitions of unlabeled data sets in using the 1-NP classifier designed 
with its terminal prototypes. LVQ is applicable to p dimensional unlabeled data. SOFM, on the 

other hand, attempts to And topological structure hidden in data and display it in one or two 

dimensions. 


We shall review LVQ and its c-means relative carefully, and SOFM in sufficient detail to 
understand its intervention in the development of generalized network clustering algorithms. 
The primary goal of LVQ is representation of many points by a few prototypes; identification 
of clusters is implicit, but not active, in pursuit of this goal. We let X = (x r x 2 , ...x n ) c * denote 

the samples at hand, and use c to denote the number of nodes (and clusters in X) in the 
competitive layer. 


The salient features of the LVQ model are contained in Figure 5. The input layer of an LVQ 

network is connected directly to the output layer. Each node in the output i^^Xork 
vector (or prototype) attached to it. The prototypes V= (v p v 2 v c ) are essentia y 

array of (unknown) cluster centers. ^ e 9t p for 1 < i < c. In this context the word learning refers 

to finding values for the (v ). When an input vector x is submitted to this network, distances 

are computed between each v f and x. The output nodes "compete", a (minimum distance) 

"winner" node , say * v is found ; and it is then updated using one of several update rules. 


Figure 5. LVQ Clustering Networks 
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We give a brief specification of LVQ as applied to the data In our examples. There are other 
versions of LVQ; this one is usually regarded as the "standard" form. 


The LVQ Clustering Algorithm 1 


LVQl. Given unlabeled data set X = (Xj. x^ ...x n } c 9i p . Fixe, T, and e >0. 

LVQ2. Initialize Vq = ( q v c cP 6 ^ P * learning rate a q € (1,0) . 

LVQ3. Fort =1.2 T; 

Fork= 1,2 n: 

a - Fmd K -v, I “g©{K-v,|}- 

b. Update the winner : Vj t = Vj t _ o^lx^- Vj t _j) 


Next k. 

d. Apply the 1-NP (nearest prototype) rule to the data : 


lvo 


I; K t J s K-' 

0; otherwise 
t. Compute E, = |v V,,,!, = ijv,, - » 
f. If < e stop; Else adjust learning rate 
Next t 


, 1 < j < cj * i 


n c 

= 1 I, 

k=l r=ll 


,l<i<c and l<k<n. 


V rk.t ~ V rk.t-l 


(6) 

(7) 


( 8 ) 


The numbers U 

Lvy 



at (8) are a cxn matrix that define a hard c-partition of X using the 


1-NP classifier assignment rule shown in (4). The vector u shown in Figure 1 represents a 
crisp label vector that corresponds to one column of this matrix; it contains a 1 in the winner 
row i at each k; and zeroes otherwise. Our inclusion of the computation of the hard 1-NP c- 
partltion of X at the end of each pass through the data (step LVQ3.d) is not part of the LVQ 
algorithm - that is. the LVQ iterate sequence does not depend on cycling through U s. Ordinarily 
this computation is done once, non-iteratively, outside and after termination of LVQ. Note 
that LVQ uses the Euclidean distance in step LVQ3.a. This choice corresponds roughly to the 
update rule shown in (7) , since V^(||x - vfj ) = -2/(x - v) = -2(x - v). The origin of this rule 

comes about by assuming that each x * is distributed according to a probability density 
function f (x) . LVQ’s objective is to find a set of v.'s such that the expected value of the square 


of the discretization error is minimized : 
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( 9 ) 


In this expression Vj Is the winning prototype for each x , and will of course vary as x ranges 
over . A sample function of the optimization problem is e = |x — v ( | . An optimal set of s 

can be approximated by applying local gradient descent to a finite set of samples drawn from f. 

The extant theory for this scheme is contained in Kohonen 12 , which states that LVQ converges 
in the sense that the prototypes V t = (Vj Vg t v c t ) generated by the LVQ iterate sequence 

converge, i.e., (V } — — >V. provided two conditions are met by the sequence {a } of 

learning rates used in (7) : 


1 a = 

t= o * 


I a <“ • 

(=0 1 

One choice for the learning rates that satisfies these conditions is the harmonic sequence 
a t = 1 / t for t >1; a Q € (0.1). Kohonen has shown that (under some assumptions) steepest 

descent optimization of the average expected error function (9) is possible, and leads to the 
update rule (7). The update scheme shown in equation (7) has the simple geometric 
interpretation shown in Figure 6. 


Figure 6. Updating the winning LVQ Prototype. 
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The winning prototype Vj ^ is simply rotated towards the current data point by moving along 
the vector (x k - which connects it to x k . The amount of shift depends on the value of a 

"learning rate" parameter o^, which varies from 0 to 1. As seen in Figure 2. there Is no update if 
0^=0. and when 0 ^= 1 , becomes x k (v^ is Just a convex combination of x k and v ( tl ). This 

process continues until termination via LVQ3.f. at which time the terminal prototypes yield a 
"best” hard c-partition of X via (3). 


Comments on LVQ : 


1. Limit point property : Kohonen 12 refers to 1314 , and mentions that LVQ converges to a 
unique limit if and only if conditions (10) are satisfied. However, nothing was said about what 
sort or type of points the final weight vectors produced by LVQ are. Since LVQ does not model a 
well defined property of clusters (in fact, LVQ does not maintain a partition of the data at all), 

the fact that (V f ) — > V does not insure that the limit vector V is a good set of prototypes 

in the sense of representation of clusters or clustering tendencies. All the theorem guarantees 
is that the sequence HAS a limit point. Thus, "good clusters" in X will result by applying the 1- 
NP rule to the final LVQ prototypes only if, by chance, these prototypes are good class 
representatives. In other words, the LVQ model is not driven by a well specified clustering goal. 

2- Learning rate a : Different strategies for often produce different results. Moreover, LVQ 

seldom terminates unless <x^— >0 (i.e.. It is forced to stop because successive iterates are 
necessarily close). 

3. Ter min ation : LVQ often runs to its iterate limit, and actually passes the optimal (clustering) 
solution in terms of minimal apparent label error rate. This is called the "over-training" 
phenomenon in the neural network literature. 


Another, older, clustering approach that Is often associated with LVQ is sequential hard c- 
means (SHCM). The updating rule of MacQueen’s SHCM algorithm is similar to LVQ 15 . In 

MacQueen's algorithm the weight vectors are initialized with the first c samples in the data set 
X. In other words, q - * r . r=l,..,c. Let q r q=1 for r=l,..,c (q r ^ represents the number of 

samples that have so far been used to update T r ^ ). Suppose is a new sample point such 
that v J t is closest (with respect to. and without loss, the Euclidean metric) to it. MacQueen's 
algorithm updates the v r 's as follows (again, index 1 identifies the winner at this t): 


T i.t+1 - (v i,t q i,t + *t+l^ q i,t +1) 

: (1 la) 

q i.t+ 1 = q i,t +1 

: (lib) 

T r,t+1 = T r,t for r * 1 - 

: (lie) 

q r.t+ 1 = q r,t for r * L 

(lid) 
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MacQueen’s process terminates when all the samples have been used once ( i.e., when t = n). The 

sample points are then labeled on the basis of nearness to the final mean vectors (that is, using 
(3) to find a hard c-partition U SHCM )- Rearranging (11a), one can rewrite Macqueen's update 

equation : 

v u + i =v i,t + (x t+r v i,t ) /q i,t+i • (12) 


Writing l/q t t+1 as 04 1+1 , equation (12) takes exactly the same form as equation (7) . However, 
there are some differences between LVQ and MacQueen's algorithm: (i) In LVQ sample points are 


used repeatedly until termination is achieved, while in MacQueen's method sample points are 

used only once (other variants of this algorithm pass through the data set many times 16 ', (ii) 
In MacQueen's algorithm 04 ^ is inversely proportional to the number of points found 

closest to Vj ^ . so it is possible to have 04 ^ < otj { w ^ en f 1 > t 2 - Th ls 13 n °t possible in LVQ. 

MacQueen attempted to partition feature space SR P Into c subregions, say (S x S c ), in such a 


way as to minimize the functional 




where / is a density function as in LVQ, and v is the (conditional) mean of the pdf J { 
obtained by restricting / to Sj, normalized in the usual way, i.e., / j(x) = /(x) I /P(Sj); and 


V = { v j. v 2 « * cp . Let V t = (▼ 1 t v c t ); = (SjIVj) S c (v t » be the minimum distance 

partition relative to v^; P(Sj) = prob(x«Sj), Pj ^ = P(Sj(v^)) = prob(x * Sj(v^)): and Vj ^ . the 


conditional mean of x over Sj(v^), is Vj ^ = Jg xdfix)/P(Sj) when P(Sj) > 0, or ▼ 
when P(Sj) = 0 . MacQeen proved that for the algorithm described by equations (1 la-d) , 




lim < 

n-*°° 


KIP,, 

t= 1 /=! J J 



n 


= 0 


Since { Vj } are conditional means, the partition obtained by applying the nearest prototype 

labeling method at (4) to them may not always be desirable from the point of view of 
clustering. Moreover, this result does not eliminate the possibility of slow but indefinite 
oscillation of the centroids (limit cycles). 
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LVQ and SHCM suffer from a common problem that can be quite serious. Suppose the input 
data X = {x 1 ,x 2 ^ 3 jc 4 .x 5 .x 6 ) c 9? 2 contains the two classes A ={x 1 ,x 2 ,x 3 ) and B = {x 4 .x 5 ,Xg) as 
shown in Figure 7. The initial positions of the centroids Vj Q and v 2 Q are also depicted in 
Figure 7. Since the initial centroid for class 2 (v 2 Q ) is closer to the remaining four input 
points than Tj. each of them will update (modify) v 2 only; Vj will not be changed on the first 

pass through the data. Moreover, both update schemes result in the updated centroid being 

pulled towards the data point some distance along the line joining the two points. 
Consequently, the chance for Vj Q to get updated on succeeding passes is very low. Although 

this results in a locally optimal solution, it is hardly a desirable one. 

Figure 7. An Initialization problem for LVQ/SHCM 



There are two causes for this problem ; (i) an improper choice of the initial centroids, and (11) 
each input updates only the winner node. To circumvent problem (i), initialization of the v.'s 

is often done with random input vectors; this reduces the probability of occurrence of the above 
situation, but does not eliminate it. Bezdek et. al 17 attempted to solve problem (ii) by updating 
the winner and some of its neighbors (not topological, but metrical neighbors in ) with 
each input in FLVQ. In their approach, the learning coefficient was reduced both with time and 
distance from the winner. FLVQ, in turn, raised general two issues : defining an appropriate 
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neighborhood system, and deciding on strategies to reduce the learning coefficient with 
distance from the winner node. These two issues motivated the development of the GLVQ 
algorithm. 


We conclude this section with a brief description of the SOFM scheme, again using t to stand for 
iterate number (or time). In this algorithm each prototype v f t * 9t p is associated with a 

display node ,say d r t « 9? 2 . The vector Vj t that best matches ( in the sense of minimum 

Euclidean distance in the feature space) an incoming input vector x k is then identified as in 

( 4 ). Vj t has an "image" dj t in display space. Next, a topological (spatial) neighborhood t ) 

centered at d. is defined in display space, and its display node neighbors are located. Finally, 

- 1 

the vector t and other prototype vectors in the inverse image l^d 1 1 ) I of spatial 
neighborhood ^dj t ) are updated using a generalized form of update rule (7) : 


v r,t = v r,t- 1 + “rkft (x k‘ v r,t-l } ’ 


drt e tfdy). 


(13) 


The function a rk t defines a learning rate distribution on indices (r) of the nodes to be updated 
for each input vector x k at each iterate t. These numbers impose (by their definition) a sense of 
the strength of interaction between (output) nodes. If the (v r t ) are initialized with random 
values and the external inputs x k = x k (t) are drawn from a time invariant probability density 
function /(x), then the point density function of v f t ( the number of v r t 's in the ball B(x k .c) 
centered at the point x k with radius c ) tends to approximate / (x) . It has also been shown that 
the v r t 's attain their values in an "orderly fashion" according to /(x) 12 . This process is 

continued until the weight vectors “stabilize.” In this method then, a learning rate distribution 

over time and spatial neighborhoods must be defined which decreases with time in order to 
force termination (to make a rk t =0). The update neighborhood also decreases with time. While 

this is clearly not a clustering strategy, the central tendency property of the prototypes often 
tempts users to assume that terminal weight vectors offer compact representation to clusters of 
feature vectors; in practice, this is often false. 
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4. GENERALIZED LEARNING VECTOR QUANTIZATION (GLVQ) 


In this section we describe a new clustering algorithm which avoids or fixes several of the 
limitations mentioned earlier. The learning rules are derived from an optimization problem. 

Let x « 9t p be a stochastic input vector distributed according to a time invariant probability 
distribution /(x). and let 1 be the best matching node as in (7). Let L x be a loss function which 

measures the locally weighted mismatch (error) of x with respect to the winner : 


L = L(x 

X 





, where 


% 



if r-t ' 
otherwise - . 


(14a) 


(14b) 


Let X = (Xj x n ,...} be a set of samples from / (x) drawn at time instants t=l,2 n Our 

objective is to find a set of c v r 's , say V = (v^ such that the locally weighted error functional L x 
defined with respect to the winner Vj is minimized over X. In other words, we seek to 

Minimize : r(V) = JJ... f £ g |x - v f/fxjdx (15) 

<%p r=l ^ A r > 

For a fixed set of points X = {x^ x n ) the problem reduces to the unconstrained optimization 

problem: 


Minimize : 


nvj 


n c 

1 l9 lr 

taksO: 


n 



(16) 


Here L x is a random functional for each realization of x, and T(V) is its expectation. Hence 

exact optimization of r using ordinary gradient descent is difficult . We have seen that i , the 
index for the winner, is a function of x and all of s. The function L x is well defined. If we 

assume that x has a unique distance from each v r , then i and g are uniquely determined, and 
hence L x is also uniquely determined. However, if the above assumptions are not met, then i 
and g will have discontinuities. In the following discussion we assume that g does not have 
discontinuities so that the gradient of L x , exists. As most learning algorithms do 18 , we 
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approximate the gradient of r(V) by the gradient of the sample function L x . In other words. We 
attempt to minimize r by local gradient descent search using the sample function L x> It is our 
conjecture that the optimal values of v f 's can be approximated In an iterative, stepwise 
fashion by moving in the direction of gradient of L x . The algorithm is derived as follows (for 

notational simplicity the subscript for x will be ignored). First rewrite L as : 


L = T f = l I T l * 

' 1 r*i ~ 

- + ' j,Hl - 1- ’f ' I Hi 

» |« - v f + 1 - |x v f / ■ 


(17) 


Differentiating L with respect Vj yields (after some algebraic manipulations) : 


V T L(v t ) = -2 (x-v 4 ) 


D 2 -D 



(18) 


where 



. On the other hand, differentiation of L with respect to Vj (j * i) yields: 


V^fVj) = -^x-Vj) 



(19) 


Update rules based on (17) and (18) are : 


_ _ 4 ./»fx-v ) — — — L for the winner node i. and ( 20 ) 

y i.t ~ i.t-i t y u-v d* 


l» T ulf 

ir^ 


for the other (c- 1) nodes. J*i . (21) 
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To avoid possible oscillations of the solution, the amount of correction should be reduced as 
Iteration proceeds. Moreover, like optimization techniques using subgradient descent search, 
as one moves closer to an optimum the amount of correction should be reduced (in fact, 
should satisfy the following two conditions : as t -> -> 0 and I -> -) 19 On the other 

hand, in the presence of noise, under a suitable assumption about subgradients, the search 
becomes successful if the conditions in (10) are satisfied. We recommend a decreasing sequence 
of ®t ( 0 < ctj < 1 ) satisfying (10) , which insure that is neither reduced too fast nor too slow. 

From the point of view of learning, the system should be stable enough to remember old 
learned patterns, and yet plastic enough to learn new patterns (Grossberg calls it the stability- 
plasticity dilemma) 20 . Condition (10a) enables plasticity, while (10b) enforces stability . In 
other words, an incoming input should not affect the parameters of a learning system too 
strongly, thereby enabling it to remember old learned patterns (stability); at the same time, 
the system should be responsive enough to recognize any new trend in the input (plasticity). 
Hence, <x t can be taken as <XQ(l-t/T), where T is the maximum number of iterations the learning 
process is allowed to execute and «.q is the initial value of the learning parameter. Referring to 

(20), we see that when the match is perfect then nonwinner nodes are not updated; in other 

words, this strategy then reduces to LVQ. On the other hand, as the match between x and the 
winner node Vj decreases, the impact on other (nonwinner) nodes increases. This seems to be 

an intuitively desirable property. We summarize the GLVQ algorithm as follows: 


GLVQ Clustering Algorithm? 


GLVQ1. Given unlabeled data set X = {x r x^ ...x^c 9t p . Fixe, T.and e>0. 

GLVQ2. Initialize V Q = ( Vj Q q) e 9^. and learning rate Oq e (1.0) . 

GLVQ3. For t = 1.2 T. 

a. Compute = Oq (1-t/T) . 


While k<n 


bFind K-’u-.l-f?e{K-v.|}- 

c. Update all (c) weight vectors (v r t ) with 

D 2 -D + |, t -T U -,f 


+ (X JC 


D 




214 


Wend 


r,f 


V r. ( -l + a < 



(r*J) 


, D = 



d. Compute |V, - V ,| = |Jt , - V, ,|, ' r |J u *., - “rtu-i 

e. If < e stop; Else 


Next t. 


GLVQ4. Compute non-iteratively the nearest prototype GLVQ c-partitlon of X : 


GLVO 


k t ,i s K t J isjscj " 


0; otherwise 


,l<i<c and l<k<n. 


Comments on GLVQ : 

1. There Is no need to choose an update neighborhood . 

2. Reduction of the learning coefficient with distance (either topological or in SR P ) from the 
winner node is not required. Instead, reduction is done automatically and adaptively by the 
learning rules. 


3. For each input vector, either all nodes get updated or no node does. When there is a perfect 
match to the winner node, no node is updated. In this case GLVQ reduces to LVQ. 

4. The greater the mismatch to the winner ( i.e., the higher the quantization error), the greater 
the impact to weight vectors associated with other nodes. Quantization error is the error in 
representing a set of input vectors by a prototype - in the above case the weight vector 
associated with the winner node. 

5. The learning process attempts to minimize a well-defined objective function. 

6. Our termination strategy is based on small successive changes in the cluster centers. This 
method of algorithmic control offers the best set of centroids for compact representation 
(quantization) of the data in each cluster. 
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4. FUZZY LEARNING VECTOR QUANTIZATION (FLVQ) 


Huntsberger and Ajjimarangsee 1 1 used SOFMs to develop clustering algorithms. Algorithm 1 
In 11 Is the SOFM algorithm with an additional layer of neurons. This additional set of 
neurons does not participate In weight updating. After the self-organizing network terminates, 
the additional layer, for each input, finds the weight vector (prototype) closest to it and assigns 
the input data point to that class. A second algorithm In their paper used the necessary 
conditions for FCM to assign a membership value In [0,1) to each data point. Specifically, 

Huntsberger and Ajjimarangsee suggested fuzzification of LVQ by replacing the learning rates 
(“ik t^ usually found In rules such as (7) with fuzzy membership values (u^ t ) computed with 

the FCM formula 2 : 


a 


ik.t 


= U„ 


C 

= I 


Dik,t 


k.t {j=\Djk.t 


-2 

m-1 


( 22 ) 


where D *,=h v uL . Numerical results reported In Huntsberger and Ajjimarangsee suggest 

that in many cases their algorithms and standard LVQ produce very similar answers. Their 
scheme was a partial Integration of LVQ with FCM that showed some interesting results. 
However, it fell short of realizing a model for LVQ clustering; and no properties regarding 
terminal points or convergence were established. Moreover, since the objective of these LVQ is 
to find cluster centroids (prototypes), and hence clusters, there Is no need to have a topological 
ordering of the weight vectors. Consequently, the approach taken In 1 1 seems to mix two 
objectives, feature mapping and clustering, and the overall methodology Is difficult to 
Interpret in either sense. 


Integration of FCM with LVQ can be more fully realized by defining the learning rate for 
Kohonen updating as : 


-2m, 

v, = = • where ( 23a > 

m f = m 0 + t[(m / -m 0 )/T] = m 0 + tAm ; m,m Q z i; t=1.2....T. (23b) 

m t replaces the (fixed) parameter m in (22). This results In three families of Fuzzy LVQ or FLVQ 
algorithms, the cases arising by different treatments of paramerer m,. In particular, for 
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( e |12 T}. we have three cases depending on the choice of the initial ( m 0 ) an< ^ final ( m j ) 

values of m: 


1 . 

2 . 

3. 



: Descending FLVQ 

(24a) 

m 0 <m / => { m J tm / 

m o = m j => m t = m 0 = m 

; Ascending FLVQ 

(24b) 

: FLVQ = FCM 

(24c) 


Cases 1 and 3 are discussed at length by Bezdek et. al. 17 . Case 2 is fully discussed in Tsao et. 
al. 21 . Equation (24c) asserts that when m 0 = , FLVQ reverts to FCM; this results from 

defining the learning rates via (23a), and using them in FLVQ3.b below. FLVQ is not a direct 
generalization of LVQ because it does not revert to LVQ in case all of the u^ t 's are either 0 or 1 

(the crisp case). Instead, if m Q = m f = 1. FCM reverts to HCM, and the HCM update formula, 

which is driven by finding unique winners, as is LVQ, is a different formula than (7). FLVQ is 
perhaps the closest possible link between LVQ and c-Means type algorithms. We provide a 
formal description of FLVQ : 



Piittv I.VQ fFLVQl 


FLVQ1. Given unlabeled data set X = (Xj. *2 x n ^ Ftx c ' T ’ I ta anc * e > °‘ 

FLVQ2. nltialize v 0 = ( v i o v c,0* e ^ P ' Choose m o’ m / 

FLVQ3. Fort = 1, 2 T. 

a. Compute all (cn) learning rates (a^ with (23). 

n n 

b. Update all (c) weight vectors (▼ it ) with v l t = t .j + ” v u-i* ^ s ?i °W 

c. Compute E t = |v ( - v ,| = |Jv (( - 

d. If E^ < e stop; Else 



For fixed c. (v l t ) and n^. the learning rates a lk t 


= ( Ulk t ) m t at (23a) satisfy the following : 


where k is a positive constant. Apparently the contribution of x k to the next update of the node 
weights is inversely proportional to their distances from it. The “winner" in (29) is the ▼i.t-l 
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closest to x k . and It will be moved further along the line connecting Vj ^ ^ (o x k than any of the 
other weight vectors. Since lu^ = 1 =* la^S 1. this amounts to distributing partial updates 
across all c nodes for each x k e X. This is In sharp contrast to LVQ, where only the winner Is 
updated for each data point. 

In descending FLVQ (24a), for large values of m t (near mQ), all c nodes are updated with lower 
individual learning rates, and as m^— >1, more and more of the update is given to the “winner" 

node. In other words, the lateral] distribution of learning rates is a function of t, which in the 
descending case “sharpens" at the winner node (for each x k ) as m f — — *1. Finally, we note 

again that for fixed m t . FLVQ updates the {v l t } using the conditions that are necessary for 

FCM; each step of FLVQ is one iteration of FCM. 

Figure 8. Updating Feature Space Prototypes in FLVQ Clustering Nets. 



U - 1 


) 


Figure 8 illustrates the update geometry of FLVQ; note that every node is (potentially) updated 
at every iteration, and the sum of the learning rates is always less than or equal to one. 


Comments on FLVQ : 

1. There is no need to choose an update neighborhood . 

2. Reduction of the learning coefficient with distance (either topological or in 9l p ) from the 
winner node is not required. Instead, reduction is done automatically and adaptively by the 
learning rules. 
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3. The greater the mismatch to the winner ( i.e., the higher the quantization error), the smaller 
the Impact to the weight vectors associated with other nodes (recall (25) and (2c)). This Is 
directly opposite to the situation in GLVQ. 

4. The learning process attempts to minimize a well-defined objective function (stepwise). 

5. Our termination strategy Is based on small successive changes In the cluster centers. This 
method of algorithmic control offers the best set of centroids for compact representation 
(quantization) of the data in each cluster. 

6. This procedure depends on generation of a fuzzy c-partition of the data, so It Is an iterative 
clustering model - Indeed, stepwise, it Is exactly fuzzy c-means 17 . 


5. IMAGE SEGMENTATION WITH GLVQ AND FLVQ 


In this section we illustrate the (FLVQ and GLVQ) algorithms with image segmentation, which 
can be achieved either by finding spatially compact homogeneous regions in the Image; or by 
detecting boundaries of regions, i.e., detecting the edges of each region. We have applied our 
clustering strategies to both paradigms. Image segmentation by clustering raises the Important 
Issue of feature extraction / selection. Generally, features relevant for identifying compact 
regions are different from those useful for the edge detection approach. 

Feature selection for homogeneous region extraction 

When looking for spatially compact regions, feature vectors should incorporate Information 
about the spatial distribution of gray values. For pixel (i.j) of a digital image F= ((i.j) I 1 < i < M ; 
1 < J < N) . we define the d^ 1 order neighborhood of (i.j) . where d > 0 is an Integer as ; 


JV d ={(Jc.l)e F) suchthat (l,J)«N d and If (tdeivf then (£, J) eN d 

I.J l *J l 'J "•* 


(26) 


Several such neighborhoods are depicted in Figure 9, where N d ^ consists of all pixels marked 

with an index < d . For example N 1 is obtained by taking the four nearest neighbor pixels to 
(i.j). Similarly. N 2 Is defined by its eight nearest neighbors, and so on. N d ^ as defined in (26) is 

the standard neighborhood definition for modeling digital images using Gibbs or Markov 
Random Fields. To define feature vectors for segmentation, we extend the definition of a d-th 
order neighborhood at (26) to include the center pixel (i.j): 


Nf] = u «£.J )} 


V 


K 


(27) 
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Figure 9 . An Ordered Neighborhood system 
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Next, let L= {1,2,..., G} be the set of gray values that can be taken by pixels In the image, and let 
be the intensity at (i,j) in F, that is, J:Ft-tL. We define the collection of gray values of all 
pixels that belong to N*' as: 

S* =</(Jc.I)I(Jc.I)gA^'} ( 28) 

Note that may contain the same gray value more than once. We say two neighborhoods 
N tj 311(1 N k.i are homogeneous in case Sj 1 and S^ ( are identical up to a permutation. 

This assumption is natural and useful as long as the neighborhood size is small. To see this, 
consider two 100x100 neighborhoods that contain 5000 pixels with gray value 1 and 5000 
with value G. Satisfaction of this property gives the impression of two perfectly homogeneous 
regions ; but in fact one of these neighborhoods might have all 5000 pixels of each intensity in, 
say, the upper and lower halves of the image, while other neighborhood has a completely 
random mixture of black and white spots. When the neighborhood size is small, however, 
spatial rearrangement of a few gray values among many more in the entire image will not 
create a much different impression to the human visual system as far as homogeneity of the 
region is concerned. Therefore, for small values of d we can derive features for (i,j) from S* 

which are relatively independent of permutation of its elements (typically, such features 
might include the mean, standard deviation, etc. of the intensity values in sf ). 

Subsequently, these features are arrayed into a pixel vector Xy for each pixel. In this 
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investigation, we used the gray values in themselves as the feature vector for pixel (i J); 

D 

thus, each (i.j) in F (excluding boundaries) is associated with Xy in 91 1 . 

Since FLVQ and GLVQ both use distances between feature vectors, we sorted the values in S*j 
to get each Xy. Sorting can be done either in ascending or in descending order, but the same 

strategy must be used for all pixels. We remark that an increase in the d-size of the 
neighborhood will obscure finer details in the segmented image; conversely, a very low value 
of d usually results in too many small regions. Experimental Investigation suggests that 
3 < d S 5 provides a reasonable tradeoff between fine and gross structure. 


Feature selection for edge extraction 

Loosely speaking edges are regions of abrupt changes in gray values. Therefore, features used 
for extraction of homogeneous regions are not suitable for edge-nonedge classification. For 
this approach, we nominate a feature vector Xy in 9t 3 with three components : standard 
deviation, gradient 1 and gradient 2. In other words, each pixel is represented by a 3-tuple Xy 
= Mi,J),GUi,J),G2(i,j)). The standard deviation is defined on S* as follows: 

where n is the average gray value overS d . Since standard deviation measures variation of 

UJ ‘J 

gray values over the neighborhood, using too large a neighborhood will destroy its utility for 
edge detection. The two gradients are defined as : 


Gl(i. J) =1 f M J - - f UJ+l I ; and 

G2(t,J] =l/ t+1 j +1 + ~ -ft- ij+i'- 


(30) 

(31) 


Note that G1 measures intensity changes in the horizontal and vertical directions, while G2 
takes into account diagonal edges; this Justifies the use of both G1 and G2. 

Implementation 

FLVQ (ascending strategy) and GLVQ were used for segmentation of the house image depicted in 
Figure 10(a). This image is a very complex image for segmentation into homogeneous regions, 
because it has some textured portions (the trees) behind the house. For the region extraction 
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scheme we used neighborhoods of order d=3 and d=5. The number of classes chosen was c=8. 
The computing protocols used for different runs are summarized in Table 2. 

Table 2. Computing protocols for the segmentations 


Since FLVQ produces fuzzy labels for each pixel vector, the fuzzy label vector is defuzzified 
using the maximum membership rule at (5). Thus, each pixel receives a crisp label 
corresponding to one of the c classes in the segmented image. Coloring of the segmented image 
Is done by using c distinct gray values, one for each class. Defuzzification is not required for 
the GLVQ algorithm as it produces hard labels. 

Figure 10 contains some typical outputs of both FLVQ and GLVQ using the region-based 
segmentation approach. To show the effect of sorting we ran both algorithms with unsorted 
and sorted feature vectors. Figure 10(b) represents the segmented output produced by FLVQ 
with d=3 and unsorted features; while figure 10(c) displays the output under the same 
conditions, but with sorted features. Comparing figures 10(b) and (c) one sees that the noisy 
patches on the roof of the house that appear in Fig. 10(b) are absent in Fig. 10(c). Similar 
occurences can be found in other portions of the image. This demonstrates that sorted pixel 
vectors seem to afford some noise cleaning ability. Figure 10(d) was produced with FLVQ using 
sorted neighborhoods of size 5. Note that the textured tree areas have been segmented more 
compactly; this illustrates the effect of increasing the neighborhood size. Figures 10 (e) and (f) 
are produced by the GLVQ algorithm with sorted neighborhoods of orders 3 and 5, respectively. 
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Comparing figures 10(c) and (e) we find that FLVQ and GLVQ are comparable for the house, but 
GLVQ extracts more compact regions for the tree areas. Another interesting thing to note is 
that for GLVQ with a window of size 5x5, the roof of the house is very nicely segmented with 
sharp inter-region boundaries; this is not true for all other cases using either algorithm. 
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We used the same image (Figure 10 (a)) to test the edge-based approach. The results produced by 
FLVQ and GLVQ are shown in Figures 1 1(a) and (b), respectively. Comparing these two figures, 
one can see that both algorithms have extracted the compact regions nicely. A careful 
analysis of the images shows that FLVQ detects more edges than GLVQ. As a result of this FLVQ 
produces some noisy edges and GLVQ fails to extract some important edges. To summarize, 
both algorithms produce reasonably good results, but GLVQ has a tendency to produce larger 
compact (homogeneous) areas than that by the FLVQ. It appears that GLVQ is less sensitive to 
noise which might cause a failure to extract finer details. 


Fig. 11(a) FLVQ (edge/nonedge) Fig. 11$) GLVQ (edge/nonedge) 



6. CONCLUSIONS 

We have considered the role of and Interaction between fuzzy and neural-like models for 
clustering, and have Illustrated two generalizations of LVQ with an application in image 
segmentation. Unlike methods that utilize Kohonen's SOFM idea, both algorithms avoid the 
necessity of defining an update neighborhood scheme. Both methods are designed to optimize 
performance goals related to clustering, and both have update rules that allocate and distribute 
learning rates to (possibly) all c nodes at each pass through the data. Ascending and descending 
FLVQ updates all nodes at each pass, and learning rates are related to the fuzzy c-means 
clustering algorithm. This yields automatic control of the learning rate distribution and the 
update neighborhood is effectively all c nodes at each pass through the data. FLVQ can be 
considered a (stepwise) implementation of FCM. GLVQ needs only a specification of the 
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learning rate sequence and an initialization of the c protoytpes. GLVQ either updates all 
nodes for an input vector, or it does not update any. When an input vector exactly matches the 
winner node, GLVQ reduces to LVQ. Otherwise, all nodes are updated inversely proportionally 
to their distances from the input vector. 
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